Method and apparatus for statistical text filtering

ABSTRACT

Disclosed herein is a method for automatically filtering a corpus of documents containing textual and non-textual information of a natural language. According to the method, through a first dividing step ( 101 ), the document corpus is divided into appropriate portions. At a following determining step ( 105 ), for each portion of the document corpus, there is determined a regularity value (V R ) measuring the conformity of the portion with respect to character sequences probabilities predetermined for the language considered. At a comparing step ( 107 ), each regularity value (V R ) is then compared with a threshold value (V T ) to decide whether the conformity is sufficient. Finally, at a rejecting step ( 111 ), any portion of the document corpus whose conformity is not sufficient is rejected and removed from the corpus. An apparatus for carrying out such a method is also disclosed.

CLAIM OF PRIORITY

[0001] This application claims the foreign priority benefits under 35U.S.C. § 119 of European application No. 00480126.2 filed on Dec. 20,2000, which is incorporated herein by reference.

BACKGROUND OF THE INVENTION

[0002] 1. Field of the Invention

[0003] The invention relates in general to statistical languagemodeling. More particularly the invention relates to a method forautomatically filtering a corpus of documents containing textual andnon-textual information of a natural language to model, in order toobtain a corpus of documents that is well representative of the naturallanguage. The invention also relates to an apparatus for carrying outsuch a method.

[0004] 2. Description of Related Art

[0005] Textual information is commonly formatted for the human eye,intermingled with non-textual information such as tables, graphics, etc.When such textual information needs to be processed by a machine (e.g.for delivery to a human through speech synthesis or for translationpurpose), it becomes necessary to separate what really constitutes text(i.e. a succession of words and punctuation) from the non-textualinformation.

[0006] One such requirement applies to the elaboration of text corporafor statistical language modeling. Present statistical models used inNatural Language Processing (NLP) systems, such as speech recognitionsystems, require the analysis of large bodies of documents.

[0007] These documents, collectively referred to as corpus, need to beas “true-to-life” as possible and are therefore collected from a widevariety of sources. As a consequence, together with the desired textualinformation (the “wheat”) in those corpora, there is usually a lot ofnon-exploitable data (the “chaff”), such as binary attachments, images,logos, headers, footers, tables, line drawings and so on.

[0008] Thus, prior to running a meaningful statistical analysis on sucha corpus of documents, the corpus needs to be cleaned up so that onlythe “real” textual portions are kept.

[0009] Up to now, the above “cleaning” operation of a corpus ofdocuments is commonly performed in a manual way, that is, each documentis edited by a person on a display screen and the document is “filtered”upon visual inspection.

[0010] As a typical document corpus contains tens of millions of words,manual editing and filtering is extremely labor-intensive and costly. Itcan also be error-prone, and potentially have dramatic consequences,e.g. if a corpus is damaged beyond repair by an over-enthusiastic use ofthe delete function.

[0011] In order to reduce the time necessary to achieve such a visualfiltering of a corpus of documents, some software tools have beendeveloped to assist people in performing this task. These software toolswere designed to automate visual rules based on heuristics and “ad-hoc”observations.

[0012] Such rules are for instance: “Delete lines that contain less than20% lowercase characters”, or “Delete lines that are more than 256characters long”. Other rules were defined, based on visual inspectionof the documents, such as: “Delete all the text that appears between twolines formed by ‘-------’” (when this is the way a table of numbers ispresented in a given corpus).

[0013] All the above rules, even when they are implemented in a computerprogram, rely on visual inspection of the corpus and on humanintervention. With such a “manual” filtering procedure, the cost of asequence of filtering operations is commonly estimated to range, inaverage, from 1 to 2-man week, depending on the corpus size and thenumber of different sources it encompasses.

[0014] Thus, as underlined above, given the great deal of time requiredby present corpus filtering methods to operate, and the high risk oferrors they imply as a consequence of human intervention, there is realneed of a corpus filtering method that improves such an empiric methodof filtering a large corpus of documents. This need is presentlyaddressed by the invention disclosed herein.

SUMMARY OF THE INVENTION

[0015] A main object of the invention is therefore to provide animproved method for filtering a large corpus of documents, which remedythe aforementioned drawbacks of current filtering methods.

[0016] To this end, according to a first aspect, the invention concernsa method for automatically filtering a corpus of documents containingtextual and non-textual information of a natural language. The methodcomprises the steps of:

[0017] dividing the corpus of documents into appropriate portions;

[0018] determining for each portion of the corpus of documents aregularity value measuring the conformity of the portion with respect tocharacter sequences probabilities predetermined for the languageconsidered;

[0019] comparing each regularity value with a threshold value to decidewhether the conformity is sufficient; and

[0020] rejecting any portion of the corpus of documents whose conformityis not sufficient.

[0021] This new method as implemented in a computer program provides anefficient means for filtering a large corpus of documents in a quick andnon error-prone way.

[0022] According to a particularly advantageous characteristic of theinvention, the predetermined character sequence probabilities arederived from a statistical model representative of the language.

[0023] In this way, the criteria used for rejecting or keeping adocument portion reflect accurately the conformance or non-conformanceof the document portion with regard to the rules of the languageconsidered.

[0024] According to a preferred embodiment of the invention, thestatistical model is previously elaborated from a reference documentdetermined as conforming with the rules of the language underconsideration.

[0025] According to a variant embodiment, the statistical model isinitially used to filter a first segment, of a predetermined size, ofthe corpus of documents. The resulting first filtered segment thenserves as a basis for computing a more accurate statistical model, whichis to be used to filter the rest of the corpus of documents.

[0026] This iterative procedure provides the additional advantage thatthe latter model will tend to match much better the words and the formatof the corpus in question than any “general-purpose model” could do.

[0027] According to a second aspect, the invention concerns an apparatusfor automatically filtering a corpus of documents containing textual andnon-textual information of a natural language. The apparatus comprises:

[0028] means for dividing the corpus of documents into appropriateportions;

[0029] means for determining for each portion of the corpus of documentsa regularity value measuring the conformity of the portion with respectto character sequences probabilities predetermined for said language;

[0030] means for comparing each regularity value with a threshold valueto decide whether the conformity is sufficient; and

[0031] means for rejecting any portion of the document corpus whoseconformity is not sufficient.

[0032] The invention also relates to a computer system comprising anapparatus as briefly defined above.

[0033] The invention still concerns a computer program comprisingsoftware code portions for performing a method as briefly defined above,when the computer program is loaded into and executed by a computersystem.

[0034] The invention further concerns a computer program product storedon a computer usable medium. The computer program product comprisescomputer readable program means for causing a computer to perform anautomatic document corpus filtering method as briefly defined above.

[0035] The advantages of this apparatus, this computer system, thiscomputer program, and this computer program product are identical tothose of the method as succinctly disclosed above. Other particularitiesand advantages of the invention will also emerge from the followingdescription.

BRIEF DESCRIPTION OF THE DRAWINGS

[0036] In the accompanying drawings, given by way of non-limitingexamples:

[0037]FIG. 1 is a flow chart illustrating the essential steps of adocument corpus filtering method according to the invention;

[0038]FIG. 2 is a flow chart illustrating the process of elaborating alanguage model forming the base for the determination of the regularityvalue of a given portion of the corpus of documents;

[0039]FIG. 3 is a functional block diagram of an apparatus forautomatically filtering a corpus of documents, in conformity with theinvention.

DETAILED DESCRIPTION OF THE INVENTION

[0040] The present invention aims to provide a method for automaticallyfiltering a corpus of documents containing a textual and non-textualinformation, in order to obtain a corpus of documents whose overallcontent can be considered as sufficiently representative of a naturallanguage which is to be statistically analyzed.

[0041] The term “filtering” shall be construed as meaning the removingfrom the collection of documents making up the corpus, those portionswhich are not representative of the language under consideration, suchas non-textual portions (e.g. graphics, tables) and textual portionsexpressed in another language.

[0042] With reference to FIG. 1, a description will be given of thecorpus filtering method according to the invention. FIG. 1, which is aflow chart, depicts the essential steps of this corpus filtering method.

[0043] As shown in FIG. 1, the corpus filtering method according to theinvention starts with a step 101 of dividing a corpus of documentsdenoted 10 (which is to be filtered) into appropriate portions. In step101, the document corpus is divided into portions—e.g. lines, paragraphsor whole documents—whose size is determined as a function of thedocument corpus' overall size and/or as a function of the nature of thedocuments contained in the corpus. The size determined for the portionsresulting from the division makes it possible to obtain a granularitydesired for the filtering.

[0044] Each portion resulting from the dividing step will then betreated independently from the others as will be explained further.

[0045] For example, a section of a non-cleaned corpus may resemble asfollows.

[0046] Example of an “unclean” corpus document: Average 85% of WeekMarket Market Spli Day Price Price DOG Adju 00/11/10  95.69 81.34 92.731 00/11/09  97.84 83.17 92.73 1 00/11/08 101.50 86.28 92.73 1 00/11/07102.09 86.78 92.73 1 00/11/06 100.90 85.77 92.73 1 00/11/03 101.00 85.8592.73 1

[0047] In the above example, the language under consideration forstatistical analysis is the English language. As can be seen in thisexample document, the portion size suitable for the dividing step wouldbe a paragraph. With a paragraph being defined as a set of charactersthat is isolated upwards and downwards by at least one blank line.

[0048] When passed through the filter, with an appropriate model andthreshold, the first paragraph (English text) would be retained, thesecond (stock price table) would be rejected as not being text, thethird (English text again) would be retained, and the last one (Frenchtext) would also be rejected, since its letter sequences are “odd” withrespect to the letter sequence expectations of the English language.

[0049] Returning to FIG. 1, after the unclean document corpus 10 hasbeen divided into appropriate sized portions (step 101), according tothe invention, for each portion, there will be determined a regularityvalue measuring the conformity of the portion with respect to charactersequences probabilities predetermined for the language underconsideration. To this end, step 103 is first entered to select one ofthe corpus portions (current portion) resulting from the division of thecorpus performed in step 101.

[0050] Then, in step 105, a regularity value denoted V_(R) isdetermined. As previously mentioned this regularity value is intended tomeasure the conformity of the portion selected with respect to charactersequences probabilities predetermined for the language underconsideration.

[0051] According to a preferred implementation of the invention, thecharacter sequence probabilities are derived from a statistical model(40) representative of the language considered. In this preferredimplementation, the regularity value V_(R) is based on a computedperplexity of the portion with respect to the statistical model. As willbe detailed later down in the description, prior to the corpusfiltering, the statistical model is elaborated from a reference documentdetermined as conforming with the rules of the language. The process ofcomputing such a language model, in accordance with a preferredimplementation of the invention, will be detailed further below inconnection with FIG. 2.

[0052] According to a preferred embodiment of the invention, thestatistical model is a character-based N-gram model.

[0053] Language models such as character-based N-gram models are knownin the art. In general terms, a language model, as for instance a N-grammodel, tries to predict the a-priori probability of a N character longstring occurring in a given language. Theoretically, one would like topredict a new character from an infinitely long history of predecessorcharacters. Practically, however, these probabilities would beimpossible to compute. A common approach is then to approximate allhistories to the same state. Thus one assumes that the occurrence of acharacter C is completely determined by the past N characters. Tri-grammodels, for instance, use the two preceding characters to predict thecurrent character. As the tri-gram frequencies may not be seenfrequently enough to yield good prediction, the tri-gram model is oftencombined with lower-order models predicting the bi- and uni-gramprobabilities.

[0054] According to a preferred implementation for the regularity value,it is suggested to compute the perplexity of the orthographicrepresentation of a word with respect to a character-based N-gram model.

[0055] Perplexity is an information-theory measurement, expressed as anumber. It is an indication of how many different letters are likely tofollow a particular context of string characters.

[0056] Informally perplexity may be regarded as the average number offollowing characters that a character-based language model may have tochoose from, given the present history of characters already looked at.

[0057] Formally, the perplexity is the reciprocal of the geometricaverage of the probabilities of a hypothesized string of characters.

[0058] Returning to FIG. 1, once the regularity value has beendetermined (step 105) for the current portion of the corpus, step 107 isentered, in which there is made a comparison between the regularityvalue V_(R) and a threshold value V_(T), in order to decide whether theconformity of the current portion with respect to the charactersequences probabilities derived from the statistical model 40, issufficient or not.

[0059] According to the invention, threshold value V_(T) is determinedbeforehand by firstly defining a test corpus as a subset of the documentcorpus to be filtered. Then a manual cleaning is performed on the testcorpus so as to obtain a cleaned test corpus, which is representative ofthe type of textual information that is considered as being sufficientlyin conformity with the language rules. After that, a perplexity value ofsaid cleaned test corpus with regard to said statistical model iscomputed. Similarly, it is computed a perplexity value of the rejectedtest corpus (i.e., the set of portions rejected from the initial testcorpus). Finally, the threshold value searched is determined between thetwo perplexity values obtained (for example as the average value ofthese two perplexity values).

[0060] At step 109 in FIG. 1, if the conformity of the portion underconsideration is determined as being sufficient, the portion is kept(step 113). Conversely, if the portion is determined as beinginsufficient, the portion is rejected (step 111).

[0061] Following step 115 is a determination step, in which it isdetermined whether all portions of the document corpus have beenprocessed. If not, a next portion is selected (step 103) and thepreceding steps are performed again on the new portion selected.

[0062] On the contrary, if it is so, at next step 117, all portions thathave been kept i.e. not rejected, are gathered in order to form a newcorpus of documents 20 which is considered as “cleaned” or filtered. Theresulting filtered corpus is then stored for further use.

[0063] Now, with reference to FIG. 2, there will be described theprocess of elaborating a language model forming the base for thedetermination of the regularity value of a given portion of the corpusof documents, in accordance with a preferred implementation of theinvention.

[0064] The process starts (step 201) by collecting a corpus of textualdata deemed to follow the “regularity” that is to be modeled, both incontent (types of word), and in form (punctuation, line breaks, specialcharacters, etc.). The collection of textual data obtained is thenmanually cleaned (step 203) to keep only pertinent textual data (e.g.,graphics, other language text, are suppressed). A clean training corpus30 is therefore obtained and stored.

[0065] At following step 205, the clean training corpus 30 is subdividedinto training data 33 and held-out data 35, by randomly selecting acertain percentage of the corpus (e.g. 10%). As will be describedhereafter, training data will actually serve as a basis to computeN-grams statistics upon which the statistical model will be determined.On the other hand, held-out data 35 will be used to optimize thestatistical model computed from the training data.

[0066] As shown at step 207 of FIG. 2, training data 33 is used tocompute 1-gram, 2-gram and 3-gram models. The models are computed bycounting uni-letter frequencies, bi-letter frequencies, and tri-letterfrequencies. The frequencies obtained are then used as approximations ofthe probability of such letter sequences. The construction andfunctioning of such N-grams models is known within the state of the art.The overall likelihood of a sequence of 3 letters is computed as alinear combination of the uni-letter, bi-letter and tri-letterlikelihood, with an added offset to give non-zero probabilities tonever-observed letter sequences.

[0067] At step 209, the coefficients of the linear combination can beestimated using the held-out data 35 in order to optimize theperformance of the statistical model. A state of the art approach forthis process can be found in the teaching of F. Jelinek and R. L.Mercer, “Interpolated Estimation of Markov Source Parameters from SparseData” in Proc. of the workshop on Pattern Recognition in Practice,North-Holland Publishing Company, 1980.

[0068] Lastly, at step 211, the final statistical model 40 is generatedand stored.

[0069] According to a preferred implementation of the invention,control/formatting characters such as “tab”, “space”, “new line” areincluded in the alphabet of the language to model, in order to not onlymodel the probable letter sequences contained in the language words of adocument, but also model the form of the document content.

[0070] In accordance with a variant implementation of the invention, inorder to improve to accuracy of the corpus filtering, the statisticalmodel is initially used to filter a first corpus segment of apredetermined size to provide a first filtered segment of the documentcorpus. Then, the first filtered segment serves as a basis for computinga more accurate statistical model, which is to be used to filter therest of the corpus of documents.

[0071] Now, in relation to FIG. 3, there will be described an apparatusfor automatically filtering a corpus of documents, in conformity withthe invention.

[0072] The apparatus (3) depicted in FIG. 3 includes software andhardware components. In a preferred embodiment, the filtering method ofthe invention is implemented through a computer program, which is to berun in a computer system for example a microcomputer, in order to carryout the filtering method.

[0073] Apparatus 3 comprises a corpus storing unit 301 in whichdocuments forming the corpus can be stored. For example, the storingunit 301 may comprise a hard disk drive, or a Compact Disk (CD) drive.Apparatus 3 includes a corpus input/output unit 303, which isresponsible for retrieving from storing unit 301 documents, which are tobe processed i.e., filtered, or storing into storing unit 301 documentsonce filtered.

[0074] Filtering apparatus 3 also includes a corpus dividing unit 307intended for dividing the document corpus into appropriate portions asdescribed above in connection with FIG. 1.

[0075] Still within filtering apparatus 3, a regularity computation unit305 is responsible for determining for each portion of the documentcorpus a regularity value measuring the conformity of the portion withrespect to character sequences probabilities predetermined for thelanguage considered.

[0076] A conformity determination unit 309 is then responsible forcomparing each regularity value with a threshold value, predetermined asexplained supra, to decide whether the conformity is sufficient or not.

[0077] Conformity determination unit 309 also handles the task ofrejecting any portion of the document whose conformity is determined asbeing insufficient.

[0078] Lastly, a corpus gathering unit 311 makes it possible to gatherall the document portions that have not been rejected by the conformitydetermination unit 309, so as to form the cleaned corpus. The cleanedcorpus is then stored into the corpus storing unit 301.

[0079] Finally, the filtering apparatus 3 has a control unit 313, forcontrolling the overall functioning of the apparatus. In particular,control unit 313 is responsible for determining the sequencing of theoperations performed by the other units, and for assuring the transferof the working data from one unit to another.

[0080] In summary, there have been disclosed herein a method and anapparatus for automatically filtering a corpus of documents containingtextual and non-textual information of a natural language. According tothe method, through a first dividing step, the document corpus isdivided into appropriate portions. At a following determining step, foreach portion of the document corpus, there is determined a regularityvalue measuring the conformity of the portion with respect to charactersequences probabilities predetermined for the language considered. At acomparing step, each regularity value is then compared with a thresholdvalue to decide whether the conformity is sufficient. Finally, at arejecting step, any portion of the document corpus whose conformity isnot sufficient is rejected and removed from the corpus.

[0081] One advantage of this method is that it allows the automaticdetermination of the regularity of portions of textual data with regardto specific language rules. The filtering process according to theinvention is implemented as a computer program, which runs in a matterof minutes, as opposed to weeks of skilled labor, required by priormanual methods. Depending on the conditions (initial cleanliness of thecorpus, size, etc.), the threshold value (V_(T)), used to decide whetherthe conformity of a current portion is sufficient or not, can beadjusted to balance the false rejections (i.e., clean text labeled asnoise) with respect to the false acceptances (i.e., non-textual portionsnot flagged as such).

[0082] Persons skilled in the art will recognize that many variations ofthe teachings of this invention can be practiced that still fall withinthe claims of this invention, which follow.

1. A method for automatically filtering a corpus of documents containingtextual and non-textual information of a natural language, the methodbeing characterized in that it comprises the steps of: dividing thecorpus of documents into appropriate portions; determining for eachportion of the corpus of documents a regularity value (V_(R)) measuringthe conformity of the portion with respect to character sequencesprobabilities predetermined for said language; comparing each regularityvalue with a threshold value (V_(T)) to decide whether the conformity issufficient; and rejecting any portion of the corpus of documents whoseconformity is not sufficient.
 2. Method according to claim 1, whereinsaid character sequences probabilities is derived from a statisticalmodel representative of said language.
 3. Method according to claim 2,wherein for each portion of the corpus of documents, said regularityvalue (V_(R)) is based on a computed perplexity of the portion withrespect to said statistical model.
 4. Method according to claim 2,wherein said statistical model is previously elaborated from a referencedocument determined as conforming with the rules of said language. 5.Method according to claim 2, wherein said statistical model is beingdetermined according to N-gram statistics.
 6. Method according to claim2, wherein said statistical model is a character-based N-gram model. 7.Method according to claim 2, wherein said statistical model is initiallyused to filter a first corpus segment of a predetermined size to providea first filtered segment of the corpus of documents, said first filteredsegment serving as a basis for computing a more accurate statisticalmodel which is to be used to filter the rest of the corpus of documents.8. Method according to claim 1, wherein said threshold value (V_(T)) isdetermined by executing the following steps of: defining a test corpusas a subset of the corpus of documents to be filtered; manually cleaningsaid test corpus so as to obtain a cleaned test corpus which isrepresentative of the type of textual information that is considered asbeing sufficiently in conformity with the language rules and a rejectedtest corpus that is the complement of said cleaned test corpus;computing a perplexity value for each of said cleaned and rejected testcorpora with regard to said statistical model; and setting the thresholdvalue searched between the perplexity values computed.
 9. Methodaccording to claim 1, wherein said portions comprise lines, paragraphs,and whole documents—whose size is determined as a function of theoverall size of the corpus of documents or as a function of the natureof the documents contained in the corpus of documents or both, so as toobtain a granularity desired for the filtering.
 10. An apparatus forautomatically filtering a corpus of documents containing textual andnon-textual information of a natural language, the apparatus beingcharacterized in that it comprises: means for dividing the corpus ofdocuments into appropriate portions; means for determining for eachportion of the corpus of documents a regularity value measuring theconformity of the portion with respect to character sequencesprobabilities predetermined for said language; means for comparing eachregularity value with a threshold value to decide whether the conformityis sufficient; and means for rejecting any portion of the corpus ofdocuments whose conformity is not sufficient.
 11. Apparatus according toclaim 10, wherein said character sequences probabilities are derivedfrom a statistical model representative of said language.
 12. Apparatusaccording to claim 11, wherein for each portion of the corpus ofdocuments, said regularity value (V_(R)) is based on a computedperplexity of the portion with respect to said statistical model. 13.Apparatus according to claim 11, wherein said statistical model ispreviously elaborated from a reference document determined as conformingwith the rules of said language.
 14. Apparatus according to claim 11,wherein said statistical model is being determined according to N-gramstatistics.
 15. Apparatus according to claim 11, wherein saidstatistical model is a character-based N-gram model.
 16. Apparatusaccording to claim 11, wherein said statistical model is initially usedto filter a first corpus segment of a predetermined size to provide afirst filtered segment of the corpus of documents, said first filteredsegment serving as a basis for computing a more accurate statisticalmodel which is to be used to filter the rest of the corpus of documents.17. Apparatus according to claim 10, wherein said threshold value(V_(T)) is determined by executing the following steps of: defining atest corpus as a subset of the corpus of documents to be filtered;manually cleaning said test corpus so as to obtain a cleaned test corpuswhich is representative of the type of textual information that isconsidered as being sufficiently in conformity with the language rulesand a rejected test corpus that is the complement of said cleaned testcorpus; computing a perplexity value for each of said cleaned andrejected test corpora with regard to said statistical model; and settingthe threshold value searched between the perplexity values computed. 18.Apparatus according to claim 10, wherein said portions comprise lines,paragraphs, and whole documents—whose size is determined as a functionof the overall size of the corpus of documents or as a function of thenature of the documents contained in the corpus of documents or both, soas to obtain a granularity desired for the filtering.
 19. A computersystem comprising an apparatus according to claim
 10. 20. A computerprogram comprising software code portions for performing a methodaccording to claim 1, when said computer program is loaded and executedby a computer system.
 21. A computer-readable program storage mediumwhich stores a program for executing a method for automaticallyfiltering a corpus of documents containing textual and non-textualinformation of a natural language, the method being characterized inthat it comprises the steps of: dividing the corpus of documents intoappropriate portions; determining for each portion of the corpus ofdocuments a regularity value (V_(R)) measuring the conformity of theportion with respect to character sequences probabilities predeterminedfor said language; comparing each regularity value with a thresholdvalue (V_(T)) to decide whether the conformity is sufficient; andrejecting any portion of the corpus of documents whose conformity is notsufficient.
 22. Computer-readable program storage medium according toclaim 21, wherein said character sequences probabilities is derived froma statistical model representative of said language. 23.Computer-readable program storage medium according to claim 22, whereinfor each portion of the corpus of documents, said regularity value(V_(R)) is based on a computed perplexity of the portion with respect tosaid statistical model.
 24. Computer-readable program storage mediumaccording to claim 22, wherein said statistical model is previouslyelaborated from a reference document determined as conforming with therules of said language.
 25. Computer-readable program storage mediumaccording to claim 22, wherein said statistical model is beingdetermined according to N-gram statistics.
 26. Computer-readable programstorage medium according to claim 22, wherein said statistical model isa character-based N-gram model.
 27. Computer-readable program storagemedium according to claim 22, wherein said statistical model isinitially used to filter a first corpus segment of a predetermined sizeto provide a first filtered segment of the corpus of documents, saidfirst filtered segment serving as a basis for computing a more accuratestatistical model which is to be used to filter the rest of the corpusof documents.
 28. Computer-readable program storage medium according toclaim 21, wherein said threshold value (V_(T)) is determined byexecuting the following steps of: defining a test corpus as a subset ofthe corpus of documents to be filtered; manually cleaning said testcorpus so as to obtain a cleaned test corpus which is representative ofthe type of textual information that is considered as being sufficientlyin conformity with the language rules and a rejected test corpus that isthe complement of said cleaned test corpus; computing a perplexity valuefor each of said cleaned and rejected test corpora with regard to saidstatistical model; and setting the threshold value searched between theperplexity values computed.
 29. Computer-readable program storage mediumaccording to claim 21, wherein said portions comprise lines, paragraphs,and whole documents—whose size is determined as a function of theoverall size of the corpus of documents or as a function of the natureof the documents contained in the corpus of documents or both, so as toobtain a granularity desired for the filtering.